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Abstract:- A special moduli set Residue Number System (RNS) of high dynamic range (DR) can speed up the 
execution of very large word-length repetitive multiplications found in applications like public key cryptography. 
The modulo 2™ — 1 multiplier is usually the noncritical datapath among all modulo multipliers in such high-DR 
RNS multiplier. This timing slack can be exploited to reduce the system area and power consumption without 
compromising the system performance. With this precept, a family of radix-8 Booth encoded modulo 2™ — 1 
multipliers, with delay adaptable to the RNS multiplier delay, is proposed. The modulo 2™ — 1 multiplier delay is 
made scalable by controlling the word-length of the ripple carry adder, k employed for radix-8 hard multiple 
generation. Formal criteria for the selection of the adder word-length are established by analyzing the effect of 
varying k on the timing of multiplier components. It is proven that for a given n, there exist a number of feasible 
values of k such that the total bias incurred from the partially-redundant partial products can be counteracted by 
only a single constant binary string. This compensation constant for different valid combinations of n and k can be 
precomputed at design time using number theoretic properties of modulo 2 n — 1 arithmetic and hardwired as a 
partial product to be accumulated in the carry save adder tree. The adaptive delay of the proposed family of 
multipliers is corroborated by CMOS implementations. In an RNS multiplier, when the critical modulo multiplier 
delay is significantly greater than the noncritical modulo 2™ — 1 multiplier delay, k = n and k = - are 
recommended for n not divisible by three and divisible by three, respectively. Conversely, when this difference 
diminishes, k is better selected as n A and n /Aor n not divisible by three and divisible by three, respectively. 
Our synthesis results show that the proposed radix-8 Booth encoded modulo 2™ — 1 multiplier saves substantial 
area and power consumption over the radix-4 Booth encoded multiplier in medium to large word-length RNS 
multiplication. 

Index Terms:- Booth algorithm, design space exploration, modulo arithmetic, multiplier, residue number system 
(RNS). 

I. INTRODUCTION 

RIVEST, Shamir, and Adleman (RS A) and elliptic curve cryptography (ECC) are two of the most well established 
and widely used public key cryptographic (PKC) algorithms. The encryption and decryption of these PKC algorithms are 
performed by repeated modulo multiplications [l]-[3]. These multiplications differ from those encountered in signal 
processing and general computing applications in their sheer operand size. Key sizes in the range of 512-1024 bits and 
160-512 bits are typical in RSA and ECC, respectively [4]-[7]. Hence, the long carry propagation of large integer 
multiplication is the bottleneck in hardware implementation of PKC. The residue number system (RNS) has emerged as a 
promising alternative number representation for the design of faster and low power multipliers owing to its merit to 
distribute a long integer multiplication into several shorter and independent modulo multiplications [8]— [1 1]. RNS has also 
been successfully employed to design fault tolerant digital circuits [12], [13]. 

To transcend the 3n-bit limit, moduli sets, £2™ 1- 2™ , 2™ + 1, 2 + 1}, 

{22" - 1,2",2 2 " -+- l} and {2- - 1,2\2" +1,2^1-1} with DR of 5n bits and 

{2" — 1,2™ + 1,2 l ./2 + 1} wkh DR of 6n bitS; haye been proposed recent iy [21], [30], [31]. 
consequently, a RSA cryptosystem with a conservative key-length of 5 12 bits could be implemented in RNS using either the 
5n-bit DR moduli set with n = 100 or the 6n -bit DR moduli set with n = 85 . For a ECC cryptosystem with a typical key- 
length of 256 bits, either the 5n-bit DR moduli set with n = 50 or the 6n -bit DR moduli set with n = 42 could be chosen. 

The delay of an integer multiplication in RNS domain based on the -bit DR moduli set of [31] for example, is 
governed by the delay of the modulo 2 2n +l multiplier. As the time complexity of partial product summation by a carry save 
adder (CSA) tree and a two-operand parallel-prefix adder is a logarithmic function of , the critical path delay can be modeled 
as 0(log2n), but the delays of the modulo 2™-l and modulo 2™ + l multipliers are only 0(log2n) . This speedup of around 
1/(1 ~+~ losa "«■!) >< 1O09S ijy mo( i u i 2 n -l and 2 n + l modulo multipliers over the critical path delay is of no 
consequence. As encryption and decryption in PKC involves repeated multiplications, the cumulative difference in the 
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critical and noncritical modulo multiplier delays will increase with the number of multiplications involved. For lightweight 

cryptographic applications, such as smartcards and radio frequency identification (RFID) tags, the considerations of power, 

size and cost are of paramount importance [32]. The complexity of implementing reliable cryptographic hardware can be 

reduced by an ingenious exploitation of this timing headroom in the design of RNS multiplier. 

This paper focuses on the design space exploration of arithmetic operation in one of the two special moduli, i.e., 

the modulo 2™-l multiplier design. The Montgomery modulo multiplication, while computing the modular product without 

trial division, is modulus-independent and incapable of exploiting number theoretic properties of modulo 2™-l arithmetic for 

combinational circuit simplification. The properties of modulo 2™-l arithmetic were most effectively exploited for the full 

adder based implementation of modulo multiplier in [36]-[38]. In [36], the multiplier bits were not encoded, which lead to 

higher implementation area and longer partial product accumulation time. In [37] and [38], the radix-4 Booth encoding 

algorithm was employed to reduce the mber of partial products to [n/2]+l and [n/2] , respectively. The shorthand notations 

[a] and [a] denote the smallest integer greater than or equal to and the largest integer smaller than or equal to , respectively. 

With higher radix Booth encoding, the number of partial products is reduced by more than half and consequently, significant 

reduction in silicon area and power dissipation is feasible [39], [40]. The radix-8 Booth encoding reduces the number of 

partial products to [n/3] , which is more aggressive than the radix-4 Booth encoding. However, in the radix-8 Booth encoded 

modulo 2™-l multiplication, not all modulo -reduced partial products can be generated using the bitwise circular-left-shift 

_i_3 v~ 
operation and bitwise inversion. Particularly, the hard multiple ~~ "" 2" —J- is to be generated by an n-bit end-around- 

carry addition of X and 2X. The performance overhead due to the end-around-carry addition is by no means trivial and 

hence, the use of Booth encoding for modulo 2™-l multipliers have been restricted to only radix-4 in literature. 

The paper is organized as follows. Section II describes the radix-8 Booth encoding algorithm for modulo 2" — 1 

multiplication. A family of modulo 2" — 1 multipliers to adapt to different RNS delay is described in Section III. In Section 

IV, the criteria for selecting a suitable RCA word-length to achieve the desired performance are highlighted. The 

performance of the proposed family of modulo 2" — 1 multipliers is evaluated and compared against [38] in Section V. The 

paper is concluded in Section VI. The Appendix provides the derivation of the predetermined compensation constant for 

different valid combinations of the multiplier and RCA word-lengths. 
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. and Y = ^-^-i= O ■=" - represent the multiplicand and the multiplier of the modulo 
2" — 1 multiplier, respectively. The radix-8 Booth encoding algorithm can be viewed as a digit set conversion of four 
consecutive overlapping multiplier bits, ■i/3is-|-2f/3i-|-li/3i K.if3i 1} t a signed digit. The digit set conversion is 
formally expressed as 

di = y3,-i + Jfei + 27^+1 - 4y&+2 (1) 

where y_i = y n = y„ +1 = ^ 1+2 = [40]. 

TABLE I 
Modulo- Reduced Multiples for the R.adk-8 Booth Encoding 
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Table I summarizes the modulo-reduced multiples of X for all possible values of the radix-8 Booth encoded 
multiplier digit, d ( , where CLS[X,j] denotes a circular-left-shift of X by j - bit positions. 

Three unique properties of modulo 2™-l arithmetic that will be used for simplifying the combinatorial logic 
circuit of the proposed modulo multiplier design are reviewed here. 



1) Property 1: The modulo 2™-l reduction of -X can be implemented as the n-bit one's complementation of the 

binary word X as follows: 



"-^ 2 TC — 1 — — -A— A. 



(2) 



2) Property 2: For any nonnegative integers, the periodicity of an integer power of two over modulus 2™-l can be 

stated as follows [41]: 
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Property 2 ensures that the modulo 2™-l reduction of binary exponents can be implemented with no logic cost. 
As a corollary, the modulo 2™-l reduction of the product of a binary word X and an integer power of two,2 J , is 
equivalent to CLS[X,j] [14]. This property can be formally expressed as Property 3. 



3) 



Property 3: For j < n 



H-j'-l 



n-1 
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(4) 



In Table I, the modulo 2™-l reduction for G^are replaced by simple bitwise inversion and bitwise circular-left- 
shift of X using Properties 1 and 3, respectively. 
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Fig. 1. Generation of . I: 



using two _-bit RCAs. 



The above technique for +3Z 2n _iComputation involves two -bit carry-propagate additions in series such that the 
carry propagation length is twice the operand length, n . In the worst case, the late arrival of the + ~iX 2n -i may considerably 
delay all subsequent stages of the modulo 2™-l multiplier. Hence, this approach for hard multiple generation can no longer 
categorically ensure that the multiplication in the modulo 2™-l channel still falls in the noncritical path of a RNS multiplier. 

III. PROPOSED RADIX-8 BOOTH ENCODED MODULO MULTIPLIER DESIGN 

To ensure that the radix-8 Booth encoded modulo 2™-l multiplier does not constitute the system critical path of a 
high-DR moduli set based RNS multiplier, the carry propagation length in the hard multiple generation should not exceed n- 
bits. To this end, the carry propagation through the HAs in Fig. 1 can be eliminated by making the end-around-carry bit a 
partial product bit to be accumulated in the CSA tree. This technique reduces the carry propagation length to n- bits by 
representing the hard multiple as a sum and a redundant end-around-carry bit pair. The resultant [n/3]+l end-around-carry 
bits C-j in the partial product matrix may lead to a marginal increase in the CSA tree depth and consequently, may aggravate 
the delay of the CSA tree. In which case, it is not sufficient to reduce the carry propagation length to merely n-bits using the 
above technique. 

Since the absolute difference between the noncritical modulo 2™-l multiplier delay and the system critical path 
delay depends on the degree of imbalance in the moduli word-length of a RNS, the delays cannot be equalized by arbitrarily 
fixing the carry propagation length to n-bits. Instead, we propose to accomplish the adaptive delay equalization by 
representing the hard multiple in a partially-redundant form [48] . 

A. Generation of Partially -Redundant Hard Multiple 

Let \X2n-1 an d 2X 2n _ 1 be added by a group of M(n/k)k-bit RCAs such that there is no carry propagation between 
the adders. Fig. 2 shows this addition for n= 8 and k = 4 , where the sum and carry-out bits from the RCA block are 
represented as S{ and C\ respectively. In Fig. 2, the carry-out of RCA 0,C® , is not propagated to the carry input C| of RCA 
1 but preserved as one of the partial product bits to be accumulated in the CSA tree. The binary weight of the carry-out of 
RCA 1 has, however, exceeded the maximum range of the modulus and has to be modulo reduced before it can be 
accumulated by the CSA tree. 
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Fig. 2, Generation of partially-redundant |-t'3.,\| 2Tl _ 1 using t-bit RCAs, 
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By Property 2, the binary weight of C| can be reduced from 2 8 to 2° . Thus, C| is inserted at the least 
significant bit (lsb) position in Fig. 2. It should be stressed that the carry-out C| is a partial carry propagated through only 
most significant FAs and hence, is different from the end-around-carry bit in the modulo 2™-l addition of X and 2X i.e.,C 7 
of Fig. 1 
From Fig. 2, the partially-redundant form of \ + ~iX 2n -\ is given by the partial-sum and partial-carry pair (S,C)where 
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Since modulo 2™-l negation is equivalent to bitwise complementation by Property 1, the negative hard multiple in 
a partially- redundant form, 1 — ~iXl 2n -i = (S,C) , is computed as follows: 
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Fig. 4. Generation of partially-redundant simple multiples. 
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Fig. 6. Modulo-reduced partial product generation. 
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B. Generation of Partially -Redundant Simple Multiples 

The proposed technique represents the hard multiple in a biased partially-redundant form. Since the occurrences of 
the hard multiple cannot be predicted at design time, all multiples must be uniformly represented. Similar to the hard 
multiple, all other Booth encoded multiples listed in Table I must also be biased and generated in a partially-redundant form. 
Fig. 4 shows the biased simple multiples, represented in a partially-redundant form for . From Fig. 4, it can be seen that the 
generation of these biased multiples involves only shift and selective complementation of the multiplicand bits without 
additional hardware overhead. 

C. Radix-8 Booth Encoded Modulo 2"-l Multiplication With Partially-Redundant Partial Products 

The i-th partial product of a radix-8 Booth encoded modulo 2 n -l multiplier is given by 



PPi = 2 



3t 



di • X 



2"-l" 



(12) 



To include the bias B necessary for partially-redundant representation of PPi, (12) is modified to 



\di 



PP< = 2 it {B^d i -X)^_ y 



(13) 



Using Property 3, the modulo 2™-l multiplication by in (13) is efficiently implemented as bitwise circular-left- 
shift of the iased multiple,(B+di.X) . For n=8 and k= 4, Fig. 5 illustrates the partial product matrix of X.Y2 — 1 with 
([n/3]+l) partial products in partially-redundant representation. Each consists of an -bit vector, and a vector of redundant 
carry bits, and . Since and are the carry-out bits of the RCAs, they are displaced by -bit positions for a given . The bits, is 
displaced circularly to the left of by 3 bits, i.e., and are displaced circularly to the left of and by 3 bits, respectively and and 
are in turn displaced to the left of and by 3 bits, respectively. The last partial product in Fig. 5 is the Compensation Constant 
(CC) for the bias introduced in the partially- redundant representation. The derivation of this constant is detailed in Section 
IV and the Appendix. 
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Fig. 6. (a) Bit-slice of Booth Encoder (BE), (b) Bit-slice of Booth Selector (BS) 



The generation of the modulo-reduced partial products,PPO , PPI and PP2, in a partially-redundant representation 
using Booth Encoder (BE) and Booth Selector (BS) blocks are illustrated in Fig. 5. The BE block produces a signed one-hot 
encoded digit from adjacent overlapping multiplier bits as illustrated in Fig. 6(a). The signed one-hot encoded digit is then 
used to select the correct multiple to generate PPi . A bit-slice of the radix-8 BS for the partial product bit, is shown in Fig. 
6(b). 

As the bit positions of do not overlap, as shown in Fig. 4, they can be merged into a single partial product for 
accumulation. The merged partial products, and the constant CC are accumulated using a CSA tree with end-around-carry 
addition at each CSA level and a final two-operand modulo 2"-l adder as shown in Fig. 7. 

IV. SELECTION OF K 

The guidelines for choosing the RCA word-length, k, to achieve the desired performance are presented in this 
section. 
Firstly, irrespective of the targeted delay, the choice of must satisfy the following two criteria. 

1) Criterion 1: As the residues of modulus 2"-l are represented using only bits, it is imperative that divides is a trivial case 
and is excluded from this consideration. This criterion is expressed. 



2) Criterion 2: Since each partial product in radix-8 Booth encoding is shifted by three bits relative to the previous partial 
product, must not be a multiple of three to ensure that the bits are nonoverlapping. 
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In the proposed modulo 2" -\ multiplier, each partial product is incremented by a bias of as expressed in (13). To 
negate the effect of the bias, a constant CC is added and the value of CC is given by 

L:n/3j 

CC= - ^ B-2^ 

-i=o 



2™-l 



(14) 



where B is an -bit binary word consisting of logic one at bit position 2kj , and logic zero at all other positions as defined in 
(7). 

It is evident that the value of CC depends only on and . As CC is considered as one or more partial products to be 
summed in the CSA tree, the choice of indirectly determines the regularity of the multiplier design and consequently its 
efficiency in VLSI implementation. A detailed analysis on the computation of CC for various combinations of n and k is 
presented in the Appendix. For any that satisfies Criteria 1 and 2, it is shown that CC can be simplified by the properties of 
modulo 2" -1 arithmetic and precomputed at design time. The resultant CC is shown to be a single binary word with a 
specific repetitive pattern of logic ones and zeros. As the generation of CC involves merely the assignment of logic constants 
to appropriate bit positions, it can be directly hardwired into the CSA tree as a constant partial product without any logic 
circuitry. 

The effect of k on the delay of the constituent components of a radix-8 Booth encoded modulo 2" -1 multiplier is 
analyzed qualitatively and summarized in Table II. As indicated in Table II, the partial products and CC can be generated in 
constant time. Similarly the delay of the final two-operand parallel- prefix modulo 2" -1 adder is independent of . From 
Table II, by reducing , the delay of the RCA reduces linearly but the delay of the CSA tree stage increases only 
logarithmically. Hence, the delay of the modulo 2" -1 multiplier is logarithmically dependent on n and almost linearly 
dependent on . For a given , the modulo 2 n -\ multiplier delay can be manipulated by varying the word-length of the RCA, . 
In the following section, we show by means of synthesis results how the modulo multiplier delay can be matched to the RNS 
delay to save silicon area and reduce power dissipation. 



V. PERFORMANCE COMPARISON 

In this section, we evaluate the performance of the proposed family of partially-redundant modulo 2 n -1 multipliers 
with different suitably chosen RCA word-length, k . The proposed multipliers are also compared against the recent radix-4 
Booth encoded modulo 2 n -\ multiplier [38]. 

For experimental analysis, k is selected as n , n/2 and n/4 to satisfy Criteria 1 and 2 when n is not divisible by three. When n 
is divisible by three but not by higher powers of three, k is selected as n/3 and n/6. As M partial-carry bits are introduced by 
each partially-redundant partial product, the number of additional partial products resulting from merging the 
nonoverlapping bits is given by Q . 
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TABLE III 
Synthesis Results When n is Not Divisible by Three 
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TABLE IV 
Synthesis Results When n is Divisible by Three 
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Area (jam" 1 ) 


Delay (ns) 


12 
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4.90 
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4.87 


24 


37302 
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9.78 
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The proposed modulo 2" -\ multiplier designs for various feasible combinations of and were specified in VHDL, 
synthesized using Synopsys Design Compiler (V2004.06-SP2) and mapped to TSMC 0.18 1.8 V CMOS standard-cell 
library. The designs were synthesized under nominal synthesis design environment, i.,e., 25 C and 1.8 V initially before the 
timing constraint from the RNS is imposed. The area and delay synthesis results are shown in Table III for that is not 
divisible by three and in Table IV for n that is divisible by three. 

As the dynamic power dissipation of a combinational circuit is dependent on the input pattern, a Monte Carlo 
simulation method [49], [50] using a finite number of randomly generated test patterns is adopted to estimate the average 
power dissipation with 99.9% confidence that the error is bounded below 3% for a data rate of 20 Msamples/s. The average 
dynamic power and the leakage power are listed in Tables V and VI for n not divisible and divisible by three, respectively. 

VI. CONCLUSION 

A family of low-area and low-power modulo 2"-l multipliers with variable delay to achieve delay balance 
amongst individual modulo channels in a high-DR RNS multiplier was proposed. The delay of the proposed multiplier is 
controlled by the word-length of the small parallel RCAs that are used to compute the requisite hard multiple of the radix-8 
Booth encoded multiplication in a partially-redundant form. The trade-offs between the RCA word-length and the VLSI 
performance metrics, .i.e, area, delay and power dissipation of the modulo 2"-l multiplier were analyzed by means of 
CMOS implementations. For maximal area and power savings, n when n is not divisible by three and n/3 when n is 
divisible by three, were recommended for the RCA word-length when the RNS multiplier delay exceeded the noncritical 
modulo 2"-l multiplier delay substantially. Conversely, when the RNS multiplier and the modulo 2"-l multiplier delays 
were nearly balanced, RCA word-lengths of n/4 and n/6 were recommended when is not divisible and divisible by three, 
respectively. From synthesis results constrained by the critical channel delay of the RNS, it was shown that the proposed 
multiplier simultaneously reduces the 

area as well as the power dissipation of the radix-4 Booth encoded multiplier for n > 28 , which is the useful 
dynamic range of RNS multiplication to meet the minimum key-size requirements of ECC and RSA algorithms. 
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